enable import from GCS emulator without PublicHost
#248
+136
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fixes #209
Summary of problem:
There is an issue with the job that imports files from GCS, specifically when using the GCS Emulator. As detailed in issue #209, attempts to import data from the GCS Emulator sometimes does not work.
This happens when
publicHost
is not set in GCS Emulator, or access not usingpublicHost
.We have spent quite some time investigating this issue, and considering there's already an issue created with comments on it, we believe there is value in making it work without needing to set a publicHost.
cause
The problem arises due to two different URL formats used for accessing objects in the GCS Emulator:
The second URL pattern is only valid for accesses to publicHost in the GCS Emulator. The Go GCS SDK, when downloading files from GCS (using
client.Bucket(...).Object(...).NewReader()
) , accesses the latter URL format, which requires a valid publicHost and results in errors if it's not set.The issue can be pinpointed in the code here:
When building the URL for data reading, the method at google-cloud-go#L788-L793 is used. This method does not take the API prefix (
storage/v1
) into account, considering only the host, bucket name, and object path. It is internally used in the NewReader method at bigquery-emulator#L1087.However, in the JSON API this problem does not occur, because even when data reading, it uses the former URL format. (google-api-go-client#L12441).
This issue seems to be specific to the Emulator and not a problem with standard GCS usage, likely due to the ability to access objects directly through URLs without an API Prefix on storage.googleapis.com.
Changes made in this PR:
I have enabled the option to use the JSON API, ensuring that imports work even when a publicHost is not set for the emulator. Since JSON download API introduced in v1.30.0, I have upgraded
cloud.google.com/go/storage
version.This might be more a problem with the Go GCS SDK than with the BigQuery Emulator. So, if this fix isn't right, please let me know. If that's the case, I'm thinking of making another PR to add guidelines in the README about setting a publicHost for the GCS Emulator.
Thank you for maintaining such a great product.